121 resultados para clustering

em Deakin Research Online - Australia


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering is a difficult problem especially when we consider the task in the context of a data stream of categorical attributes. In this paper, we propose SCLOPE, a novel algorithm based on CLOPErsquos intuitive observation about cluster histograms. Unlike CLOPE however, our algo- rithm is very fast and operates within the constraints of a data stream environment. In particular, we designed SCLOPE according to the recent CluStream framework. Our evaluation of SCLOPE shows very promising results. It consistently outperforms CLOPE in speed and scalability tests on our data sets while maintaining high cluster purity; it also supports cluster analysis that other algorithms in its class do not.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering is a difficult problem especially when we consider the task in the context of a data stream of categorical attributes. In this paper, we propose σ-SCLOPE, a novel algorithm based on SCLOPE’s intuitive observation about cluster histograms. Unlike SCLOPE however, our algorithm consumes less memory per window and has a better clustering runtime for the same data stream in a given window. This positions σ-SCLOPE as a more attractive option over SCLOPE if a minor lost of clustering accuracy is insignificant in the application.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a new technique to perform unsupervised data classification (clustering) based on density induced metric and non-smooth optimization. Our goal is to automatically recognize multidimensional clusters of non-convex shape. We present a modification of the fuzzy c-means algorithm, which uses the data induced metric, defined with the help of Delaunay triangulation. We detail computation of the distances in such a metric using graph algorithms. To find optimal positions of cluster prototypes we employ the discrete gradient method of non-smooth optimization. The new clustering method is capable to identify non-convex overlapped d-dimensional clusters.


Relevância:

20.00% 20.00%

Publicador:

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper discusses various extensions of the classical within-group sum of squared errors functional, routinely used as the clustering criterion. Fuzzy c-means algorithm is extended to the case when clusters have irregular shapes, by representing the clusters with more than one prototype. The resulting minimization problem is non-convex and non-smooth. A recently developed cutting angle method of global optimization is applied to this difficult problem

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper proposes a hyperlink-based web page similarity measurement and two matrix-based hierarchical web page clustering algorithms. The web page similarity measurement incorporates hyperlink transitivity and page importance within the concerned web page space. One clustering algorithm takes cluster overlapping into account, another one does not. These algorithxms do not require predefined similarity thresholds for clustering, and are independent of the page order. The primary evaluations show the effectiveness of the proposed algorithms in clustering improvement.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The rapid increase of web complexity and size makes web searched results far from satisfaction in many cases due to a huge amount of information returned by search engines. How to find intrinsic relationships among the web pages at a higher level to implement efficient web searched information management and retrieval is becoming a challenge problem. In this paper, we propose an approach to measure web page similarity. This approach takes hyperlink transitivity and page importance into consideration. From this new similarity measurement, an effective hierarchical web page clustering algorithm is proposed. The primary evaluations show the effectiveness of the new similarity measurement and the improvement of web page clustering. The proposed page similarity, as well as the matrix-based hyperlink analysis methods, could be applied to other web-based research areas..

Relevância:

20.00% 20.00%

Publicador:

Resumo:

To efficiently and yet accurately cluster Web documents is of great interests to Web users and is a key component of the searching accuracy of a Web search engine. To achieve this, this paper introduces a new approach for the clustering of Web documents, which is called maximal frequent itemset (MFI) approach. Iterative clustering algorithms, such as K-means and expectation-maximization (EM), are sensitive to their initial conditions. MFI approach firstly locates the center points of high density clusters precisely. These center points then are used as initial points for the K-means algorithm. Our experimental results tested on 3 Web document sets show that our MFI approach outperforms the other methods we compared in most cases, particularly in the case of large number of categories in Web document sets.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The human immune system provides inspiration for solving a wide range of innovative problems. In this paper, we propse an immune network based approach for web document clustering. All the immune cells in the network competitively recognize the antigens (web documents) which are presented to the network one by one. The interaction between immune cells and an antigen leads to an augment of the network through the clonal selection and somatic mutation of the stimulated immune cells, while the interaction among immune cells results in a network compression. The structure of the immune network is well maintained by learning and self-regularity. We use a public web document data set to test the effectiveness of our method and compare it with other approaches. The experimental results demonstrate that the most striking advantage of immune-based data clustering is its adaptation in dynamic environment and the capability of finding new clusters automatically.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a new data induced metric to perform un supervised data classification (clustering). Our goal is to automatically recognize clusters of non-convex shape. We present a new version of fuzzy c-means al gorithm, based on the data induced metric, which is capable to identify non-convex d-dimensional clusters.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

For many clustering algorithms, such as K-Means, EM, and CLOPE, there is usually a requirement to set some parameters. Often, these parameters directly or indirectly control the number of clusters, that is, k, to return. In the presence of different data characteristics and analysis contexts, it is often difficult for the user to estimate the number of clusters in the data set. This is especially true in text collections such as Web documents, images, or biological data. In an effort to improve the effectiveness of clustering, we seek the answer to a fundamental question: How can we effectively estimate the number of clusters in a given data set? We propose an efficient method based on spectra analysis of eigenvalues (not eigenvectors) of the data set as the solution to the above. We first present the relationship between a data set and its underlying spectra with theoretical and experimental results. We then show how our method is capable of suggesting a range of k that is well suited to different analysis contexts. Finally, we conclude with further  empirical results to show how the answer to this fundamental question enhances the clustering process for large text collections.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering of multivariate data is a commonly used technique in ecology, and many approaches to clustering are available. The results from a clustering algorithm are uncertain, but few clustering approaches explicitly acknowledge this uncertainty. One exception is Bayesian mixture modelling, which treats all results probabilistically, and allows comparison of multiple plausible classifications of the same data set. We used this method, implemented in the AutoClass program, to classify catchments (watersheds) in the Murray Darling Basin (MDB), Australia, based on their physiographic characteristics (e.g. slope, rainfall, lithology). The most likely classification found nine classes of catchments. Members of each class were aggregated geographically within the MDB. Rainfall and slope were the two most important variables that defined classes. The second-most likely classification was very similar to the first, but had one fewer class. Increasing the nominal uncertainty of continuous data resulted in a most likely classification with five classes, which were again aggregated geographically. Membership probabilities suggested that a small number of cases could be members of either of two classes. Such cases were located on the edges of groups of catchments that belonged to one class, with a group belonging to the second-most likely class adjacent. A comparison of the Bayesian approach to a distance-based deterministic method showed that the Bayesian mixture model produced solutions that were more spatially cohesive and intuitively appealing. The probabilistic presentation of results from the Bayesian classification allows richer interpretation, including decisions on how to treat cases that are intermediate between two or more classes, and whether to consider more than one classification. The explicit consideration and presentation of uncertainty makes this approach useful for ecological investigations, where both data and expectations are often highly uncertain.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Clustering is widely used in bioinformatics to find gene correlation patterns. Although many algorithms have been proposed, these are usually confronted with difficulties in meeting the requirements of both automation and high quality. In this paper, we propose a novel algorithm for clustering genes from their expression profiles. The unique features of the proposed algorithm are twofold: it takes into consideration global, rather than local, gene correlation information in clustering processes; and it incorporates clustering quality measurement into the clustering processes to implement non-parametric, automatic and global optimal gene clustering. The evaluation on simulated and real gene data sets demonstrates the effectiveness of the algorithm.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Email overload is a recent problem that there is increasingly difficulty people have faced to process the large number of emails received daily. Currently this problem becomes more and more serious and it has already affected the normal usage of email as a knowledge management tool. It has been recognized that categorizing emails into meaningful groups can greatly save cognitive load to process emails and thus this is an effective way to manage email overload problem. However, most current approaches still require significant human input when categorizing emails. In this paper we develop an automatic email clustering system, underpinned by a new nonparametric text clustering algorithm. This system does not require any predefined input parameters and can automatically generate meaningful email clusters. Experiments show our new algorithm outperforms existing text clustering algorithms with higher efficiency in terms of computational time and clustering quality measured by different gauges.